final revision
incorporate feedback into our final revision. 4 [R1]: " I don't exactly see if small batch vs large batch captures this phenomenon; if yes should say explicitly. "
We thank the reviewers for the detailed and insightful reviews. As the reviews noted, our work 1) introduces "novel Smith et al. [2017] make an explicit connection between small vs. large batch "A small discussion on if the phenomenon has been observed for different datasets/tasks with different optimizers" The phenomenon may not be true for other optimizers such as Adam, though. "concept of "memorizable and generalizable", though intuitive, is sketchy and not formally explained ... authors We acknowledge that the terms "memorizable" and "generalizable" are potentially confusing. We will revise our terminology to clarify this distinction. By "inherently noisy", we refer to the fact that high noise in the datapoints will necessitate larger sample complexity.
reviewers ' questions below and will incorporate feedback into the final revision
We thank the reviewers for the detailed and insightful reviews. As the reviewers noted, our work 1) contributes to "a Thank you for the valuable feedback on this section -- we will incorporate this in our next revision. The intuition for the proof of Theorem 3.3 is that the optimization problem is convex over the space of probability By weak regularization, we refer to the fact that ฮป 0 for our Theorem 4.1 to hold. The difficulty with ReLU networks is that if the gradient flow pushes neurons towards 0, issues of differentiability arise. One potential approach to circumvent this issue is arguing that with correct initialization, the iterates will never reach 0. This is an interesting direction for future work and we thank the reviewer for this suggestion.
strongly-convex-concave minimax problems first, which we will add in the final revision
We thank all the reviewers for their constructive comments. The intuition behind Algorithm 1 stems from a "conceptual" version of DIAG (also specified in Algorithm 1, Step 4), which is inspired from the conceptual version of Mirror-Prox (MP) (cf. We agree with and will include, the reviewer's comment, that the non-smoothness of We will devote more space to explaining the DIAG algorithm and discussing more related works. We will add a precise justification (which was omitted due to the lack of space) in the next revision. We discuss important ones below.
incorporate feedback into our final revision. 4 [R1]: " I don't exactly see if small batch vs large batch captures this phenomenon; if yes should say explicitly. "
We thank the reviewers for the detailed and insightful reviews. As the reviews noted, our work 1) introduces "novel Smith et al. [2017] make an explicit connection between small vs. large batch "A small discussion on if the phenomenon has been observed for different datasets/tasks with different optimizers" The phenomenon may not be true for other optimizers such as Adam, though. "concept of "memorizable and generalizable", though intuitive, is sketchy and not formally explained ... authors We acknowledge that the terms "memorizable" and "generalizable" are potentially confusing. We will revise our terminology to clarify this distinction. By "inherently noisy", we refer to the fact that high noise in the datapoints will necessitate larger sample complexity.